Astronomical Data Analysis Software and Systems XX 
ASP Conference Series, Vol. 

I. N. Evans, A. Accomazzi, D. J. Mink and A. H. Rots, eds. 
©2011 Astronomical Society of the Pacific 



The True Bottleneck of Modern Scientific Computing in Astronomy 

Igor Chilingarian 1 ' 2 and Ivan Zolotukhin 3 ' 2 

Y CDS, Observatoire astronomique de Strasbourg, Universite de Strasbourg, 
CNRS UMR 7550, 11 rue de I 'Universite, 67000 Strasbourg, France 

2 Sternberg Astronomical Institute, Moscow State University, 13 Universitetsky 
prospect, Moscow, 119992, Russia 

3 Observatoire de Paris, LERMA, UMR 8112, 61 Av. de V Observatoire, 75014 
Paris, France 

Abstract. We discuss what hampers the rate of scientific progress in our exponen- 
tially growing world. The rapid increase in technologies leaves the growth of research 
result metrics far behind. The reason for this lies in the education of astronomers lack- 
ing basic computer science aspects crucially important in the data intensive science era. 



1. Motivation 

Present-day astronomical instruments and large surveys produce the data flow increas- 
ing exponentially in time. The CPU power required to analyse these data is also grow- 
ing with the same pace following the Moore's law; the same applies to the data storage 
volume per price unit. However, in astronomy we do not see the exponential avalanche 
of scientific results produced with this computational power. This suggests the pres- 
ence of a bottleneck somewhere in the loop: if we consider the system containing three 
modules "A", "B", and "C" so that "A" is connected to "C" via "B", then optimizing 
features in module "A" or "C" will not produce a change in the performance of the 
system until the performance problems in module "B " are addressed. 

Where is the true bottleneck of the scientific computing? Astronomers as many 
other scientists, prefer to develop their computational codes and software systems (in- 
cluding database solutions) themselves often having no coding skills, insufficient back- 
ground in algorithms and computational science. 



2. Code Writing: Astronomers vs Software Engineers 
2.1. Scientific Software by Scientists 

Most computer programs developed by astronomers without computer science back- 
ground, regardless of their purposes (numerical modelling or simulations, data reduc- 
tion or visualisation, etc.) often have some specific common features, 
(a) They are usually written in Fortran-95, -90, -77 (or even prehistoric Fortran-4 and - 
66). Sometimes high-level languages (e.g. IDL, MATLAB) are used. Primitive building 
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scripts are used instead of Makefiles or more advanced building solutions (e.g. ant, or 
maven for Java). Code is non-portable. 

(b) They often contain the goto statement every 10-20 lines; names of variables do not 
follow any conventions, i.e. al, a2, aal; the code is unreadable: no or bad indentations, 
very long function bodies and/or source files. There is a lot of hard-coding of file and 
device names, file system paths. 

(c) They are undocumented and full of "intuitive" algorithmic solutions, such as "re- 
invented" sorting and search algorithms, which sometimes end up quite far from what 
computer science students learn at school. 

(d) The "multi-layered" code structure is another typical feature. When the author is 
returning to the same program after several months or years, he/she often finds that the 
existing procedure/function calls do not satisfy his/her needs, however is not willing to 
modify them to keep the backward compatibility. Then, a wrapper routine is created 
which is calling some underlying procedures/functions in a slightly different way. As a 
result, after several such periods of development, one can find multiple (undocumented) 
interfaces to the same functional blocks. 

(e) However, at the end the program does what it is supposed to, because the author 
knows exactly what it should do. Even though it may sometimes crash during run-time 
or have very poor performance. 

2.2. Scientific Software by IT Engineers 

The software developed by IT engineers in research is notably different. Here the qual- 
ity of the final product strongly depends on the job of a project manager. 

(a) Usually it is done using a "real" programming language: C/C++/Java, primarily 
because it is virtually impossible to find an IT professional developing in Fortran. 

(b) All necessary solutions for computational algorithms are conventio nal because the 
developer at least heard about the "Art of the computer programming" (|Knuthlll978h . 

(c) The code is usually well organized and structured; correct indentations and variable 
naming conventions are used; sometimes the author follow one of the coding styles 
(e.g. GNU). Therefore, the code becomes readable and comprehensible. 

(d) The quality and completeness of the documentation strongly depends on the project 
manager's competence. It can be from none to nearly perfect. 

(e) However, the author often does not understand the physical principles behind the 
algorithm or particular features of the instrumentation making the data looking as they 
are, therefore some bad surprises are possible. For example, some arithmetic bugs 
leading to the results which are wrong by many orders of magnitude from what is 
expected cannot be spotted by a software engineer because for him/her it is "just a 
number". This may dramatically slow down the development. 

2.3. Databases by Scientists 

The worst class of software solutions is probably DBs developed by researchers. 

(a) Often they contain custom implementation in Fortran or IDL of re-invented indexing 
solutions and primitive requests to the data. Indices and data tables are stored in a 
proprietary undocumented binary format. 

(b) If an existing database management system (DBMS) is used, then the DB usually 
contains one or several flat tables without mutual links, i.e. no data model. 
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(c) DB constraints are not used for consistency checks, in some rare cases they are 
implemented externally in a DB management interface (also often written in Fortran). 

(d) User interfaces, both application programming interface (API) and web front-end 
are undocumented, have very low usability and terrible design. 



3. Bad and Good Examples 

For obvious reasons we will not cite the corresponding references for bad examples. 
The list of good examples is neither exhaustive nor complete. 

3.1. Bad Example #1: an Unnamed Galaxy Catalogue 

The project is very interesting scientifically and recognised in the community. But. . . 

(a) There is no access interface on the web. 

(b) The data are distributed as a set of dozens of FITS tables with a total volume >10Gb 
and IDL access routines to perform queries on these tables. One has to download nearly 
everything in order to study just a handful of objects. 

(c) Therefore, huge memory requirements if one uses the whole catalogue at once. 

(d) Therefore, very slow and inefficient data access and selection. 

3.2. Bad Example #2: an Unnamed Database Using PostgreSQL 

(a) DB administration and ingestion interface (implemented in Fortran) has a function 
with over 250 arguments 

(b) Inside the DB restore script, to delete a record from a table, instead of 

DELETE FROM table 1 WHERE field 1 rvalue 1 the authors do: 
pgjdump -t tablel mydb \ grep -v valuel \ pg_restore -c mydb 

(c) One of the stored procedures which is triggered on INSERT, connects externally to 
the same DB and making some selections using this new connection. Obviously, it can- 
not see the changes introduced before the trigger had been fired because the transaction 
has not been committed. 



3.3. Good Examples #1: Technologically Advanced Projects 



1. HLA - the Hubble Legacy Archive (http : //hla . stsci . edu/). Innovative so- 
lutions implemented inside HLA include: (a) Virtual Observatory standard interfaces 
(Simple Image Access Protocol) as a hidden middleware; (b) XSLT transformation of 
VOTables into A/AX-enabled HTML pages; (c) advanced visualisation tools . 

2. SDSS Casjobs ( [http : //cas . sdss . org/Cas Jobs| ISzalav et al]|2002h . Efficient 



and easy-to-use access to a large DB featuring user management, user table upload, I/O 
of tabular data in different formats, comprehensive SQL query builder. 

3. GalexView (htt p : //galex . stsci . edu/GalexView/) - a Flash-based interactive 
web-access to the GALEX satellite images. 

4. Millennium Simulation (http : //www .rapa-garching .rapg . de/raillenniura/by 
G. Lemson) - access to the DB containing the results of large cosmological simulations 
with a comprehensive data model and full S QL access. 

5. GalMer (http : / / galraer . obspm . f r/ ( Chilin garia n et all 2010) - a DB to access 



numerical simulations of merging and interacting galaxies. The projects implements a 
set of Virtual Observatory (VO) standards, features efficient interactive preview visual- 
isation of the datasets on the server side, complex on-the-fly data analysis algorithms. 
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The JavaScript-powered web-interface working in m ost modern browsers is integrated 



with VO tools in order to visualis e complex datasets (IChilingarian & Zolotuk hin 2008 
Zolotukhin & Chihngarianll2008r) . 



3.4. Good Examples #2: Computations, Data Analysis and Visualisation 

1. GADGET-2 by V. ISpringell (120051) . a cosmological simulation code that is well doc- 
umented and easily extensible: there are numerous third-party add-ons implementing 
different physic al phenomena, e.g. radia tive transfer, metallicity evolution in galaxies. 

2. SExtractor dBertin & Arnouts|[l99 6): a software to perform object extraction and 
photometry from CCD images has very intuitive configuration, although outdated doc- 
umentation. It is relativ ely clearly coded 



3. TOPCAT/STILTS (Taylor 2005 , |2006|) - the best available platform independent 



table manipulati on software integrated with VO services and resources. 

4. CDS Aladin dBonnarel et aDlioOCh - a VO data browser for images and catalogues. 

5. SAOImage DS9 ( Jove & Mandelll2003l) - probably the most frequently used desktop 
FITS visualisation software in astronomy implementing some VO data access methods. 



4. The Main Message and a Possible Solution 

It turns out that all "good examples" were developed either by professional astronomers 
with very strong IT/CS background or by IT/CS professionals working closely with as- 
tronomers for years and understanding astronomy. One cannot simply hire an industrial 
software engineer to develop astronomical software and/or an archive and/or a database. 

A possible solution is to change the teaching paradigm for students in astronomy. 
Basic courses in algorithms, programming, software development and maintenance 
have to be made mandatory in the education of modern astronomers and physicists; 
advanced courses should be recommended to some of them. The Fortran language is 
now obsolete and we have to accept this. Instead of teaching research students to For- 
tran programming, one should teach how to interface legacy Fortran code in C/C++. 

As soon as this bottleneck is resolved, the avalanche of discoveries will loom. 
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